Data cleaning is the process of removing incorrect, duplicate, typos, or otherwise erroneous data from a dataset
Data wrangling changing the format to make it more useful for your analysis
See Import data BioST@TS tutorial
Exercise
Run the following commands to install the palmerpenguins package.
# A tibble: 6 × 17
studyName `Sample Number` Species Region Island Stage `Individual ID`
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 PAL0708 1 Adelie Penguin … Anvers Torge… Adul… N1A1
2 PAL0708 2 Adelie Penguin … Anvers Torge… Adul… N1A2
3 PAL0708 3 Adelie Penguin … Anvers Torge… Adul… N2A1
4 PAL0708 4 Adelie Penguin … Anvers Torge… Adul… N2A2
5 PAL0708 5 Adelie Penguin … Anvers Torge… Adul… N3A1
6 PAL0708 6 Adelie Penguin … Anvers Torge… Adul… N3A2
# ℹ 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
# `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
# `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
# `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>
[1] "studyName" "Sample Number" "Species"
[4] "Region" "Island" "Stage"
[7] "Individual ID" "Clutch Completion" "Date Egg"
[10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
[13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
[16] "Delta 13 C (o/oo)" "Comments"
[1] "study_name" "sample_number" "species"
[4] "region" "island" "stage"
[7] "individual_id" "clutch_completion" "date_egg"
[10] "culmen_length_mm" "culmen_depth_mm" "flipper_length_mm"
[13] "body_mass_g" "sex" "delta_15_n_o_oo"
[16] "delta_13_c_o_oo" "comments"
Rows: 344
Columns: 17
$ study_name <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
$ delta_15_n_o_oo <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ delta_13_c_o_oo <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…
| Name | penguins_clean |
| Number of rows | 344 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| Date | 1 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| study_name | 0 | 1.00 | 7 | 7 | 0 | 3 | 0 |
| species | 0 | 1.00 | 33 | 41 | 0 | 3 | 0 |
| region | 0 | 1.00 | 6 | 6 | 0 | 1 | 0 |
| island | 0 | 1.00 | 5 | 9 | 0 | 3 | 0 |
| stage | 0 | 1.00 | 18 | 18 | 0 | 1 | 0 |
| individual_id | 0 | 1.00 | 4 | 6 | 0 | 190 | 0 |
| clutch_completion | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| sex | 11 | 0.97 | 4 | 6 | 0 | 2 | 0 |
| comments | 290 | 0.16 | 18 | 68 | 0 | 10 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_egg | 0 | 1 | 2007-11-09 | 2009-12-01 | 2008-11-09 | 50 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| sample_number | 0 | 1.00 | 63.15 | 40.43 | 1.00 | 29.00 | 58.00 | 95.25 | 152.00 | ▇▇▆▅▃ |
| culmen_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.10 | 39.23 | 44.45 | 48.50 | 59.60 | ▃▇▇▆▁ |
| culmen_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.10 | 15.60 | 17.30 | 18.70 | 21.50 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.00 | 190.00 | 197.00 | 213.00 | 231.00 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.00 | 3550.00 | 4050.00 | 4750.00 | 6300.00 | ▃▇▆▃▂ |
| delta_15_n_o_oo | 14 | 0.96 | 8.73 | 0.55 | 7.63 | 8.30 | 8.65 | 9.17 | 10.03 | ▃▇▆▅▂ |
| delta_13_c_o_oo | 13 | 0.96 | -25.69 | 0.79 | -27.02 | -26.32 | -25.83 | -25.06 | -23.79 | ▆▇▅▅▂ |
Exercise
Run the following functions:
clean_names()
glimpse()
skimr::skim()
name items passes fails nNA error warning expression
1 culm 344 342 0 2 FALSE FALSE culmen_length_mm - 0 >= -1e-08
2 body 344 331 11 2 FALSE FALSE body_mass_g > 3000
3 species 1 1 0 0 FALSE FALSE is.character(species)
4 sex 344 168 165 11 FALSE FALSE sex %vin% c("FEMALES", "MALE")
dat <- penguins_clean |>
tidylog::select(study_name, species, island, individual_id, body_mass_g) |>
tidylog::filter(species %in% c("Adelie Penguin (Pygoscelis adeliae)"))select: dropped 12 variables (sample_number, region, stage, clutch_completion, date_egg, …)
filter: removed 192 rows (56%), 152 rows remaining
# A tibble: 4 × 2
Day Person
<chr> <chr>
1 Monday Becks
2 Tuesday Amy
3 Wednesday Matt
4 Wednesday Matt
Missing data is often a problem, e.g. for running a model. Typically as ecologists we sweep missing data under the carpet by using a “complete case” approach to data analysis.
If you have ever written some code like this:
you are removing missing data (NAs) from your dataset.
By throwing away potentially useful data (only including those rows without a NA in them) you reduce the information you are working with, reduce statistical power and introduce selection bias (invalidating any assumption of randomisation).
Data can be missing random or systematically.
Exercise
Use the penguins dataset to:
Select the columns species, island and body_mass_g
Filter for the species Adelie and the island Biscoe and Dream
Convert the penguin body mass from gram to kilogram
Calculate the mean bill length and standard error for each penguin species
Make a wide dataset showing the body mass for each penguin species in a separate column. And turn it back to a long dataframe.
Import the messy_data.RDS data file and clean it.